In this project report, we try to analyse the Housing price data-set. It contains the housing data for real estate markets in Sydney and Melbourne. The data-set contains unknown data trends. There are various attributes that are suspected to be influencing the housing prices linearly. Here in the project, we try to find these predictor variables based on which, we try to estimate the house prices.
#install.packages("caret")
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ✔ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(corrplot)
## corrplot 0.92 loaded
library(lubridate)
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(caTools)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
The dataset consists of house prices from King County an area in the US State of Washington, this data also covers Seattle. The dataset was obtained from Kaggle. Prediction of property prices is becoming increasingly important and beneficial. Considering the data provided, we are wrangling a large set of property sales records stored in an unknown format and with unknown data quality issues.
Columns : date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, lat, long, sqft_living15, sqft_lot15
data <- read.csv(file = 'house_prediction_data.csv')
head(data)
print(paste("Number of records: ", nrow(data)))
## [1] "Number of records: 21597"
print(paste("Number of features: ", ncol(data)))
## [1] "Number of features: 21"
summary(data)
## id date price bedrooms
## Min. :1.000e+06 Length:21597 Min. : 78000 Min. : 1.000
## 1st Qu.:2.123e+09 Class :character 1st Qu.: 322000 1st Qu.: 3.000
## Median :3.905e+09 Mode :character Median : 450000 Median : 3.000
## Mean :4.580e+09 Mean : 540297 Mean : 3.373
## 3rd Qu.:7.309e+09 3rd Qu.: 645000 3rd Qu.: 4.000
## Max. :9.900e+09 Max. :7700000 Max. :33.000
## bathrooms sqft_living sqft_lot floors
## Min. :0.500 Min. : 370 Min. : 520 Min. :1.000
## 1st Qu.:1.750 1st Qu.: 1430 1st Qu.: 5040 1st Qu.:1.000
## Median :2.250 Median : 1910 Median : 7618 Median :1.500
## Mean :2.116 Mean : 2080 Mean : 15099 Mean :1.494
## 3rd Qu.:2.500 3rd Qu.: 2550 3rd Qu.: 10685 3rd Qu.:2.000
## Max. :8.000 Max. :13540 Max. :1651359 Max. :3.500
## waterfront view condition grade
## Min. :0.000000 Min. :0.0000 Min. :1.00 Min. : 3.000
## 1st Qu.:0.000000 1st Qu.:0.0000 1st Qu.:3.00 1st Qu.: 7.000
## Median :0.000000 Median :0.0000 Median :3.00 Median : 7.000
## Mean :0.007547 Mean :0.2343 Mean :3.41 Mean : 7.658
## 3rd Qu.:0.000000 3rd Qu.:0.0000 3rd Qu.:4.00 3rd Qu.: 8.000
## Max. :1.000000 Max. :4.0000 Max. :5.00 Max. :13.000
## sqft_above sqft_basement yr_built yr_renovated
## Min. : 370 Min. : 0.0 Min. :1900 Min. : 0.00
## 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951 1st Qu.: 0.00
## Median :1560 Median : 0.0 Median :1975 Median : 0.00
## Mean :1789 Mean : 291.7 Mean :1971 Mean : 84.46
## 3rd Qu.:2210 3rd Qu.: 560.0 3rd Qu.:1997 3rd Qu.: 0.00
## Max. :9410 Max. :4820.0 Max. :2015 Max. :2015.00
## zipcode lat long sqft_living15
## Min. :98001 Min. :47.16 Min. :-122.5 Min. : 399
## 1st Qu.:98033 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.:1490
## Median :98065 Median :47.57 Median :-122.2 Median :1840
## Mean :98078 Mean :47.56 Mean :-122.2 Mean :1987
## 3rd Qu.:98118 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.:2360
## Max. :98199 Max. :47.78 Max. :-121.3 Max. :6210
## sqft_lot15
## Min. : 651
## 1st Qu.: 5100
## Median : 7620
## Mean : 12758
## 3rd Qu.: 10083
## Max. :871200
When we look at house prices, we believe that price has strong dependency on age and number of times its been renovated. Hence, we will add these two information on our dataset.
saleDate = mdy( data$date )
data$saleDateYr = as.integer(year(saleDate))
data$age = data$saleDateYr - data$yr_built
data$reno = ifelse( data$yr_renovated==0,0,1 )
data$reno = as.numeric( data$reno )
We want to filter out few columns which we believe should be our independent variables for our model. Thus, we will create a separate dataframe filtering out these columns.
maindf <- data[, c( "bedrooms", "bathrooms", "sqft_living", "view", "grade", "age", "waterfront", "long", "lat", "zipcode",
"condition", "sqft_above", "sqft_living15", "reno", "price", "sqft_lot", "floors" ) ]
head(maindf,10)
##Checking Null values Let’s check if we have any Null values which will be needed to weed out as part of our data cleaning.
sum(is.na(maindf))
## [1] 0
set.seed(123) # set seed to ensure you always have same random numbers generated
trainIndex <- createDataPartition(maindf$price, p=0.8, list=FALSE)
train_data <- maindf[ trainIndex,]
test_data <- maindf[-trainIndex,]
round(cor(train_data), digits = 2 )
## bedrooms bathrooms sqft_living view grade age waterfront long
## bedrooms 1.00 0.53 0.59 0.08 0.36 -0.16 0.00 0.13
## bathrooms 0.53 1.00 0.76 0.19 0.66 -0.50 0.06 0.22
## sqft_living 0.59 0.76 1.00 0.28 0.76 -0.31 0.10 0.24
## view 0.08 0.19 0.28 1.00 0.25 0.06 0.41 -0.08
## grade 0.36 0.66 0.76 0.25 1.00 -0.45 0.08 0.19
## age -0.16 -0.50 -0.31 0.06 -0.45 1.00 0.03 -0.41
## waterfront 0.00 0.06 0.10 0.41 0.08 0.03 1.00 -0.04
## long 0.13 0.22 0.24 -0.08 0.19 -0.41 -0.04 1.00
## lat -0.01 0.02 0.05 0.00 0.11 0.15 -0.01 -0.13
## zipcode -0.16 -0.20 -0.20 0.08 -0.19 0.35 0.03 -0.56
## condition 0.02 -0.13 -0.06 0.04 -0.14 0.36 0.01 -0.10
## sqft_above 0.49 0.69 0.88 0.17 0.76 -0.42 0.07 0.34
## sqft_living15 0.40 0.57 0.76 0.28 0.71 -0.32 0.08 0.33
## reno 0.01 0.05 0.05 0.11 0.01 0.23 0.10 -0.07
## price 0.32 0.53 0.70 0.40 0.67 -0.05 0.27 0.02
## sqft_lot 0.04 0.08 0.16 0.07 0.10 -0.04 0.02 0.22
## floors 0.18 0.50 0.36 0.04 0.46 -0.49 0.02 0.12
## lat zipcode condition sqft_above sqft_living15 reno price
## bedrooms -0.01 -0.16 0.02 0.49 0.40 0.01 0.32
## bathrooms 0.02 -0.20 -0.13 0.69 0.57 0.05 0.53
## sqft_living 0.05 -0.20 -0.06 0.88 0.76 0.05 0.70
## view 0.00 0.08 0.04 0.17 0.28 0.11 0.40
## grade 0.11 -0.19 -0.14 0.76 0.71 0.01 0.67
## age 0.15 0.35 0.36 -0.42 -0.32 0.23 -0.05
## waterfront -0.01 0.03 0.01 0.07 0.08 0.10 0.27
## long -0.13 -0.56 -0.10 0.34 0.33 -0.07 0.02
## lat 1.00 0.27 -0.02 0.00 0.05 0.03 0.31
## zipcode 0.27 1.00 0.00 -0.26 -0.28 0.07 -0.05
## condition -0.02 0.00 1.00 -0.16 -0.09 -0.06 0.04
## sqft_above 0.00 -0.26 -0.16 1.00 0.73 0.02 0.60
## sqft_living15 0.05 -0.28 -0.09 0.73 1.00 -0.01 0.59
## reno 0.03 0.07 -0.06 0.02 -0.01 1.00 0.12
## price 0.31 -0.05 0.04 0.60 0.59 0.12 1.00
## sqft_lot -0.09 -0.13 0.00 0.17 0.14 0.01 0.08
## floors 0.05 -0.05 -0.26 0.53 0.28 0.01 0.26
## sqft_lot floors
## bedrooms 0.04 0.18
## bathrooms 0.08 0.50
## sqft_living 0.16 0.36
## view 0.07 0.04
## grade 0.10 0.46
## age -0.04 -0.49
## waterfront 0.02 0.02
## long 0.22 0.12
## lat -0.09 0.05
## zipcode -0.13 -0.05
## condition 0.00 -0.26
## sqft_above 0.17 0.53
## sqft_living15 0.14 0.28
## reno 0.01 0.01
## price 0.08 0.26
## sqft_lot 1.00 -0.01
## floors -0.01 1.00
#install.packages("ggcorrplot")
library(ggcorrplot)
corr <- round(cor(train_data), 1)
# Plot
ggcorrplot(corr,
type = "lower",
lab = TRUE,
lab_size = 5,
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of Housing Dataset",
ggtheme=theme_bw)
According to our corrplot, price is positively correlated with bedroom,
bathroom, sqft_living, view , grade, waterfront, sqft_above,
sqft_basement, lat, sqft_living 15 and reno columns.
Let’s plot few scatter plot using our filtered columns for viewing the distribution and prediction with price.
p1=ggplot(data = train_data, aes(x = bedrooms, y = price)) +
geom_jitter() + geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Bedrooms and Price", x="bedrooms", y="Price")
p2=ggplot(data = train_data, aes(x = bathrooms, y = price)) +
geom_jitter() + geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Bathrooms and Price", x="bathrooms", y="Price")
p3=ggplot(data = train_data, aes(x = sqft_living, y = price)) +
geom_jitter() + geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Sqft_living and Price", x="sqft_living", y="Price")
p4=ggplot(data = train_data, aes(x = age, y = price)) +
geom_jitter() + geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Age and Price", x="age", y="Price")
p5=ggplot(data = train_data, aes(x = floors, y = price)) +
geom_jitter() + geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Floors and Price", x="floors", y="Price")
p6=ggplot(data = train_data, aes(x = reno, y = price)) +
geom_jitter() + geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Renovation and Price", x="reno", y="Price")
p7=ggplot(data = train_data, aes(x = view, y = price)) +
geom_jitter() + geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of View and Price", x="view", y="Price")
p8=ggplot(data = train_data, aes(x = grade, y = price)) +
geom_jitter() + geom_smooth(method = "lm", formula = y~x, se = FALSE)+labs(title="Scatter plot of Grade and Price", x="grade", y="Price")
grid.arrange( p1, p2, p3, p4, p5, p6, p7, p8, nrow = 4 )
We have used density plot on few example columns to view the distribution and type of data.
library(e1071)
par(mfrow=c(2, 3))
plot(density(train_data$bedrooms), main="Density Plot: Bedrooms", ylab="Frequency",
sub=paste("Skewness:", round(e1071::skewness(train_data$bedrooms), 2)))
polygon(density(train_data$bedrooms), col="green")
plot(density(train_data$sqft_living), main="Density Plot: sqft_living", ylab="Frequency",
sub=paste("Skewness:", round(e1071::skewness(train_data$sqft_living), 2)))
polygon(density(train_data$sqft_living), col="orange")
plot(density(train_data$age), main="Density Plot: age", ylab="Frequency",
sub=paste("Skewness:", round(e1071::skewness(train_data$age), 2)))
polygon(density(train_data$age), col="green")
plot(density(train_data$reno), main="Density Plot: reno", ylab="Frequency",
sub=paste("Skewness:", round(e1071::skewness(train_data$reno), 2)))
polygon(density(train_data$reno), col="orange")
plot(density(train_data$floors), main="Density Plot: floors", ylab="Frequency",
sub=paste("Skewness:", round(e1071::skewness(train_data$floors), 2)))
polygon(density(train_data$floors), col="green")
plot(density(train_data$view), main="Density Plot: view", ylab="Frequency",
sub=paste("Skewness:", round(e1071::skewness(train_data$view), 2)))
polygon(density(train_data$view), col="orange")
We have used boxplot to view the possible outliers for few example columns.
par(mfrow=c(2, 3)) # divide graph area in 2 columns
boxplot(train_data$bedrooms, main="Bedrooms")
boxplot(train_data$sqft_living, main="sqft_living")
boxplot(train_data$grade, main="grade")
boxplot(train_data$bathrooms, main="bathrooms")
boxplot(train_data$view, main="view")
boxplot(train_data$age, main="age")
We see that we have a significantly large number of outliers from the above boxplot. Outliers significantly affect our predictive models. To understand the impact of outliers, we will be plotting scatterplot with and without outliers.
Using the price data, we have removed outliers from the datset to have more accurate results.
outliers = boxplot(train_data$price,plot=FALSE)$out
outliers_data =train_data[which(train_data$price %in% outliers),]
train_data1 = train_data[-which(train_data$price %in% outliers),]
par(mfrow=c(1, 2))
plot(train_data$bedrooms, train_data$price, main="With Outliers", xlab="bedrooms", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ bedrooms, data=train_data), col="blue", lwd=3, lty=2)
# Plot of original data without outliers. Note the change of slope.
plot(train_data1$bedrooms, train_data1$price, main="Outliers removed", xlab="bedrooms", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ bedrooms, data=train_data1), col="blue", lwd=3, lty=2)
par(mfrow=c(1, 2))
plot(train_data$sqft_living, train_data$price, main="With Outliers", xlab="sqft_living", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ sqft_living, data=train_data), col="blue", lwd=3, lty=2)
# Plot of original data without outliers. Note the change of slope.
plot(train_data1$sqft_living, train_data1$price, main="Outliers removed", xlab="sqft_living", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ sqft_living, data=train_data1), col="blue", lwd=3, lty=2)
par(mfrow=c(1, 2))
plot(train_data$view, train_data$price, main="With Outliers", xlab="view", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ view, data=train_data), col="blue", lwd=3, lty=2)
# Plot of original data without outliers. Note the change of slope.
plot(train_data1$view, train_data1$price, main="Outliers removed", xlab="view", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ view, data=train_data1), col="blue", lwd=3, lty=2)
par(mfrow=c(1, 2))
plot(train_data$grade, train_data$price, main="With Outliers", xlab="grade", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ grade, data=train_data), col="blue", lwd=3, lty=2)
# Plot of original data without outliers. Note the change of slope.
plot(train_data1$grade, train_data1$price, main="Outliers removed", xlab="grade", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ grade, data=train_data1), col="blue", lwd=3, lty=2)
par(mfrow=c(1, 2))
plot(train_data$bathrooms, train_data$price, main="With Outliers", xlab="bathrooms", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ bathrooms, data=train_data), col="blue", lwd=3, lty=2)
# Plot of original data without outliers. Note the change of slope.
plot(train_data1$bathrooms, train_data1$price, main="Outliers removed", xlab="bathrooms", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ bathrooms, data=train_data1), col="blue", lwd=3, lty=2)
par(mfrow=c(1, 2))
plot(train_data$age, train_data$price, main="With Outliers", xlab="age", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ age, data=train_data), col="blue", lwd=3, lty=2)
# Plot of original data without outliers. Note the change of slope.
plot(train_data1$age, train_data1$price, main="Outliers removed", xlab="age", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ age, data=train_data1), col="blue", lwd=3, lty=2)
par(mfrow=c(1, 2))
plot(train_data$waterfront, train_data$price, main="With Outliers", xlab="waterfront", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ waterfront, data=train_data), col="blue", lwd=3, lty=2)
# Plot of original data without outliers. Note the change of slope.
plot(train_data1$waterfront, train_data1$price, main="Outliers removed", xlab="waterfront", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ waterfront, data=train_data1), col="blue", lwd=3, lty=2)
par(mfrow=c(1, 2))
plot(train_data$sqft_above, train_data$price, main="With Outliers", xlab="sqft_above", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ sqft_above, data=train_data), col="blue", lwd=3, lty=2)
# Plot of original data without outliers. Note the change of slope.
plot(train_data1$sqft_above, train_data1$price, main="Outliers removed", xlab="sqft_above", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ sqft_above, data=train_data1), col="blue", lwd=3, lty=2)
par(mfrow=c(1, 2))
plot(train_data$lat, train_data$price, main="With Outliers", xlab="lat", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ lat, data=train_data), col="blue", lwd=3, lty=2)
# Plot of original data without outliers. Note the change of slope.
plot(train_data1$lat, train_data1$price, main="Outliers removed", xlab="lat", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ lat, data=train_data1), col="blue", lwd=3, lty=2)
par(mfrow=c(1, 2))
plot(train_data$sqft_living15, train_data$price, main="With Outliers", xlab="sqft_living15", ylab="price", pch="o", col="green", cex=2)
abline(lm(price ~ sqft_living15, data=train_data), col="blue", lwd=3, lty=2)
# Plot of original data without outliers. Note the change of slope.
plot(train_data1$sqft_living15, train_data1$price, main="Outliers removed", xlab="sqft_living15", ylab="price", pch="*", col="red", cex=2)
abline(lm(price ~ sqft_living15, data=train_data1), col="blue", lwd=3, lty=2)
From these scatter plots, we conclude that the relationship between
price and bedroom, bathroom,
sqft_living, sqft_above, lat and
sqft_living15 is linear. From the above plots, we could see
that the change in slope of the best fit line after removing the
outliers. This is due to adjust the large errors generated due to
outliers for the regression line.
From the corrplot, we picked the independent variables which have positive and higher values wrt price. We will be fitting linear regression to find the accuracy on training and test data.
linearmodel = lm(price ~ bedrooms+bathrooms+sqft_living+view+grade+sqft_lot+age+floors+waterfront,
data = train_data1)
summary(linearmodel)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + view +
## grade + sqft_lot + age + floors + waterfront, data = train_data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -700039 -93309 -6537 81522 792471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.453e+05 1.191e+04 -54.184 < 2e-16 ***
## bedrooms -1.450e+04 1.573e+03 -9.219 < 2e-16 ***
## bathrooms 3.083e+04 2.538e+03 12.149 < 2e-16 ***
## sqft_living 8.563e+01 2.605e+00 32.869 < 2e-16 ***
## view 2.348e+04 1.821e+03 12.895 < 2e-16 ***
## grade 1.006e+05 1.626e+03 61.913 < 2e-16 ***
## sqft_lot -6.246e-03 2.701e-02 -0.231 0.817156
## age 2.772e+03 4.838e+01 57.295 < 2e-16 ***
## floors 3.769e+04 2.506e+03 15.042 < 2e-16 ***
## waterfront 6.935e+04 2.101e+04 3.301 0.000967 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 137200 on 16336 degrees of freedom
## Multiple R-squared: 0.5666, Adjusted R-squared: 0.5664
## F-statistic: 2373 on 9 and 16336 DF, p-value: < 2.2e-16
Using the above columns, we see that the relationship between price
and the chosen independent variable is moderately strong as R-squared
value is not too high. Also, the p-value of sqft_lot is not
very significant. Hence, we could drop it.
Declaring an observation as an outlier based on a just one (not pivotal) feature could lead to unrealistic inferences. When we look at one variable if it is extreme it or not, it should be beeter looked collectively by considering all the features. These are influencial outliers which effect the data predictions by producing unjustified results. We used cooks distance to find out such outlier’s.
cooksd <- cooks.distance(linearmodel)
Now we plot the cook’s distance.
par(mfrow=c(1, 1))
plot(cooksd, main="Influential Obs by Cooks distance",xlim=c(0,5000),ylim=c(0,0.1))
axis(1, at=seq(0, 5000, 2000))
axis(2, at=seq(0, 0.1, 0.01))
abline(h = 4*mean(cooksd, na.rm=T), col="green")
text(x=1:length(cooksd)+1,y=cooksd,labels=ifelse(cooksd>4*mean(cooksd,na.rm=T),names(cooksd),""), col="red")
we find out the influential points in the data and take out influencial outliers.
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))]) # influential row numbers
head(train_data[influential, ],10)
influential_data = train_data[influential, ]
influencial_outliers=inner_join(outliers_data,influential_data)
## Joining, by = c("bedrooms", "bathrooms", "sqft_living", "view", "grade", "age",
## "waterfront", "long", "lat", "zipcode", "condition", "sqft_above",
## "sqft_living15", "reno", "price", "sqft_lot", "floors")
train_data2 = rbind( train_data1,influencial_outliers)
We will try to fit the model with few other variables than last time with removed outliers value and stop it when we get maximum R-squared value.
linearmodel_trn = lm( data = train_data2, price~bedrooms+bathrooms+sqft_living+view+grade+age+waterfront+long+lat+zipcode+condition+sqft_above+sqft_living15+reno)
summary(linearmodel_trn)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + view +
## grade + age + waterfront + long + lat + zipcode + condition +
## sqft_above + sqft_living15 + reno, data = train_data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -567961 -75433 -7472 63376 2051290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.556e+07 1.927e+06 -8.072 7.41e-16 ***
## bedrooms -1.141e+04 1.369e+03 -8.337 < 2e-16 ***
## bathrooms 3.467e+04 2.183e+03 15.881 < 2e-16 ***
## sqft_living 5.905e+01 3.075e+00 19.203 < 2e-16 ***
## view 3.058e+04 1.616e+03 18.919 < 2e-16 ***
## grade 7.575e+04 1.511e+03 50.124 < 2e-16 ***
## age 1.726e+03 4.922e+01 35.061 < 2e-16 ***
## waterfront 2.006e+05 1.796e+04 11.168 < 2e-16 ***
## long -6.201e+04 8.710e+03 -7.119 1.14e-12 ***
## lat 5.539e+05 7.157e+03 77.391 < 2e-16 ***
## zipcode -1.927e+02 2.250e+01 -8.564 < 2e-16 ***
## condition 2.509e+04 1.603e+03 15.654 < 2e-16 ***
## sqft_above 2.230e+01 2.805e+00 7.952 1.96e-15 ***
## sqft_living15 4.418e+01 2.526e+00 17.491 < 2e-16 ***
## reno 2.719e+04 5.123e+03 5.308 1.12e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 119300 on 16356 degrees of freedom
## Multiple R-squared: 0.6888, Adjusted R-squared: 0.6886
## F-statistic: 2586 on 14 and 16356 DF, p-value: < 2.2e-16
xyplot(resid(linearmodel_trn)~fitted(linearmodel_trn))
histogram(~residuals(linearmodel_trn),width=200000)
library("mosaic")
qqmath(~resid(linearmodel_trn))
ladd(panel.qqmathline(resid(linearmodel_trn)))
From the above residual vs fitted plot, we could see that data is scattered across zero line. Hence, zero variance can be assumed. The histogram and the qq-plot shows a normal distribution, if we ignore the outliers. This could help in validating the normality assumption that is necessary for the linear regression model. Moreover, we con’t have any time series data. Hence, we can assume independece event. The sample size is big so randomization could be considered. Finally, as we see moderately high R-squared value of training model summary, we could confirm that linearity holds.
Thus, we could confirm that the linear regression model satisfies in the context.
y_pred = linearmodel_trn$fitted.values
y_true = train_data2$price
tally_table=data.frame(actual = y_true, predicted = y_pred)
mean_error = mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)
accuracy = 1-mean_error
print(paste("The accuracy of the train data is = ", accuracy, " or = ",round(accuracy*100, digits = 2), "%" ))
## [1] "The accuracy of the train data is = 0.795543730343887 or = 79.55 %"
outliers = boxplot(test_data$price,plot=FALSE)$out
outliers_data =test_data[which(test_data$price %in% outliers),]
test_data1 = test_data[-which(test_data$price %in% outliers),]
linearmodel_tst = lm(price ~ bedrooms+bathrooms+sqft_living+view+grade+sqft_lot+age+floors+waterfront,
data = test_data1)
cooksd <- cooks.distance(linearmodel_tst)
influential <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd, na.rm=T))]) # influential row numbers
influential_data = test_data[influential, ]
influencial_outliers=inner_join(outliers_data,influential_data)
## Joining, by = c("bedrooms", "bathrooms", "sqft_living", "view", "grade", "age",
## "waterfront", "long", "lat", "zipcode", "condition", "sqft_above",
## "sqft_living15", "reno", "price", "sqft_lot", "floors")
test_data2 = rbind( test_data1,influencial_outliers)
linearmodel_tst = lm( data = test_data2, price~bedrooms+bathrooms+sqft_living+view+grade+age+waterfront+long+lat+zipcode+condition+sqft_above+sqft_living15+reno)
summary(linearmodel_tst)
##
## Call:
## lm(formula = price ~ bedrooms + bathrooms + sqft_living + view +
## grade + age + waterfront + long + lat + zipcode + condition +
## sqft_above + sqft_living15 + reno, data = test_data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -402426 -74274 -7388 62211 1337677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.588e+06 3.701e+06 -1.780 0.07517 .
## bedrooms -9.659e+03 2.227e+03 -4.338 1.48e-05 ***
## bathrooms 2.829e+04 4.344e+03 6.514 8.21e-11 ***
## sqft_living 7.122e+01 5.872e+00 12.128 < 2e-16 ***
## view 4.065e+04 3.072e+03 13.232 < 2e-16 ***
## grade 7.223e+04 2.905e+03 24.868 < 2e-16 ***
## age 1.647e+03 9.739e+01 16.910 < 2e-16 ***
## waterfront 1.051e+05 3.256e+04 3.229 0.00125 **
## long -3.688e+04 1.670e+04 -2.209 0.02724 *
## lat 5.583e+05 1.377e+04 40.552 < 2e-16 ***
## zipcode -2.547e+02 4.312e+01 -5.906 3.79e-09 ***
## condition 2.775e+04 3.082e+03 9.001 < 2e-16 ***
## sqft_above 1.462e+01 5.465e+00 2.675 0.00750 **
## sqft_living15 3.763e+01 4.835e+00 7.783 8.89e-15 ***
## reno 3.248e+04 1.094e+04 2.968 0.00302 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 114900 on 4080 degrees of freedom
## Multiple R-squared: 0.6962, Adjusted R-squared: 0.6952
## F-statistic: 667.9 on 14 and 4080 DF, p-value: < 2.2e-16
xyplot(resid(linearmodel_tst)~fitted(linearmodel_tst))
histogram(~residuals(linearmodel_tst),width=200000)
library("mosaic")
qqmath(~resid(linearmodel_tst))
ladd(panel.qqmathline(resid(linearmodel_tst)))
From the above residual vs fitted plot, we could see that data is scattered across zero line. Hence, zero variance can be assumed. The histogram and the qq-plot shows a normal distribution, if we ignore the outliers. This could help in validating the normality assumption that is necessary for the linear regression model. Moreover, we con’t have any time series data. Hence, we can assume independence event. The sample size is big so randomization could be considered. Finally, as we see moderately high R-squared value of training model summary, we could confirm that linearity holds.
Thus, we could confirm that the linear regression model satisfies in the context.
y_pred = linearmodel_tst$fitted.values
y_true = test_data2$price
tally_table=data.frame(actual = y_true, predicted = y_pred)
mean_error = mean(abs(tally_table$actual-tally_table$predicted)/tally_table$actual)
accuracy = 1-mean_error
print(paste("The accuracy of the test data is = ", accuracy, " or = ",round(accuracy*100, digits = 2), "%" ))
## [1] "The accuracy of the test data is = 0.798104235768891 or = 79.81 %"
From the results of the tests that have been carried out, we can conclude that model have produced good accuracy on training and testing data. Hence, the selection of independent variables are based on correlation plot and the factors which affect the housing price in general. We saw an accuracy of 80% approx. for both our training and testing data. Hence, we could conclude that our model is predicting quite well.
However, to get the maximum model, process improvements need to be made again, one of which is checking the model for the trend of overfitting / underfitting models and modeling with other predictors like logistic regression, bayesian methods etc.